Population – it’s a very important statistic to measure. And lately, we can see a huge incoming population to Texas, especially Austin. In fact, in 2020, the measured population was 965,827, with the estimated population in 2024 being 984,567, nearly a 2% increase in population.
The data was collection from HERE,
the City of Austin Data Portal. By simply changing the
Calendar Year Issued date to be equals
2023, I was able to retrieve the csv data.
From the data, I will be investigating the following:
The zip codes that appears most frequently
What kinds of work were done often
When were most projects approved
When the projects were completed
During what day of the week most projects take place
Just to name a few…
I obtained the data by downloading the CSV file from the website.
Since the date columns were classified as chr, I had to turn
them into Date columns.
The date columns were as follows: Applied.Data,
Issued.Date, Expires.Date, and
Completed.Date.
I attempted to change the Status.Date column to be of type Date, but was encountering some issues in conversions, so I left it as is. This means that when performing analyses, this column cannot be worked on. I will rather rely on the Status.Current column to analyze the projects.
After the re-factoring of columns, The only columns to have
NA values are the
Completed.Date and
Applied.Date columns. These
NA values are important as the Completed.Date
NA’s signify that the project is yet to be completed or the project was
withdrawn and the Applied.Date NA’s signify that no one
applied for the project but was just rather issued to be done (which
could be the government issuing its own project).
The column names didn’t have to be changed (as the names are already quite descriptive and the only changes would be to remove the .’s from the names).
| Type | Description | |
|---|---|---|
| Permit.Class.Mapped | character | Type of Permit |
| Work.Class | character | Type of Work |
| Applied.Date | Date | Permit Application Date |
| Issued.Date | Date | Permit Issuance Date |
| Day.Issued | character | Day of Week Permit was Issued |
| Calendar.Year.Issued | integer | Calendar Year Permit was Issued |
| Fiscal.Year.Issued | integer | Fiscal Year Permit was Issued |
| Status.Current | character | Current Status of Project |
| Expires.Date | Date | Expiration Date of Project |
| Completed.Date | Date | Completion Date of Project |
| Original.Address.1 | character | Address of Project |
| Original.Zip | integer | Zip Code of Project |
I chose the 4 following variables to analyze:
Work.Class
Day.Issued
Status.Current
Original.Zip
From the graph here, we can see that the types of work with the lowest number of occurrences are Remodel Mobile Home and Life Safety, with Relocation and Addition also having very few occurrences. The most amount of work happens when constructing something New or performing a Remodel.
Seems like Thursday is the most popular day to issue a
building permit, and Sunday is the least popular day. It is
quite interesting to see that the weekends have the lowest number of
occurrences, but it does make sense since the least amount of officials
will be working on the weekends.
Analyzing the graph here, most of the projects that were approved did
go to completion, as indicated by the large differences in occurrences
between Final and the other statuses. A large amount of the
approved permits (projects) remained Active or had
Expired, which is interesting. Even though the permits were
approved, some of the projects remained unfinished. This could be for
several reasons.
1) It could be that the project rolled over and was completed in 2024,
and thus remained active.
2) Some of the projects ended up needing more money or time necessary to
be completed, therefore expiring.
I think the analysis can be done pretty obviously here – just by looking at the graph, it is easy to say that most of the permits for construction were approved for the zip code 78704. 78704 is where Zilker Park, South Lamar, and St. Edward’s University are, making them high-population areas (just south of UT Austin). Because of its proximity to downtown Austin, it makes most sense for most building permits to be approved for that area. The second highest zip code occurrences happens in 78745, which is right below 78704.
I chose the following variable relationships to analyze:
Issued.Date - Applied.Date (Approval Time)
Expires.Date - Issued.Date (Project Length)
Approval Time vs. Project Length (Do longer projects take more approval time?)
Work.Class vs. Project Length (How does the type of project relate to given time?)
Work.Class vs. Day.Issued (How does the type of project relate to the day issued?)
Work.Class vs. Status.Current (How does the type of project relate to it’s completion?)
With the graph above, we can view time it took for the projects to be
approved. Since the approval time for some of the projects took several
years, there was data in the 15000s. This made the boxplot nearly
unusable, and thus I chose to use log for convenient
viewing.
However, we do need the real data, and so the summary of the data is
below:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -183.0 13.0 61.0 192.7 194.0 15048.0 1
Notice how there’s a negative approval time? When I looked at the data the portal gave me from the website, it has it so that the issuing date is before the applied date, which makes no sense. So I dug a little deeper (going HERE), and doing a quick search of the address giving me the negative approval time gave me the correct information. The applied date was correct, but the issuing date was indeed wrong (which meant the day issued was also wrong). To fix this, I have manually change this.
With this, let’s revisit the distribution and summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -183.0 13.0 61.0 192.7 194.0 15048.0 1
And there we go, the summary now has a minimum of 0 days. It’s
interesting to see that 75% of the approvals
(as seen from the 3rd Quartile) takes less than 194 days, but there are
those which take up to more than 15,000 days, which is more than 41
years!
Another interesting piece of data is the NA’s. For some reason, there is
an address (if you want to know, 804 PARK BLVD) which had an Issued.Date
but an NA for the Applied.Date. This is why we are getting the NA values
in the summary.
Similar to the last one, let’s view the summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -14857.0 52.0 138.0 149.4 245.0 667.0 4276
Once again, we are getting the negative values.
But before we attack the negatives, there’s a column from the summary
labelled NA's, which is interesting. This is derived due to
some of the projects (in this case, 4276) not having a completed date
(thus set to NA) but having an issued date. When subtracting values from
NA, the result ends up being, you guessed it, NA. The projects with
statuses Active, Expired, and VOID all had NA completed dates, with each
having 2561, 1669, and 33, respectively. These total up to 4263, because
Withdrawn, Pending Permit, and On Hold all have their amount of NA’s
(which total up to 13), giving us the 4276.
Let’s filter the negative and NA values out and view the graph and
data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 56.0 141.0 165.7 246.0 667.0
Now we can visualize that the average time it took for a project to be completed was 165 (nearly 166) days, with the maximum being 667 days. More than 75% of the day differences are below 246 days.
For the sake of repetition, I will mutate the dataframe to have the above two as additional columns.
Wait, the data looks awkward, right? How were some projects completed
before they were even issued (as seen with the negative Project
Lengths)? Digging some more into this, I went back to the record (the
link I had to replace the error before), and it has the same
information. To not spread misinformation, I will filter out the
negative lengths and view the graph that way.
Looking at the graph, most of the cluster is below an Approval Time of 2500 days. But the question is how do these two relate? From previous analyses, we can know that the more horizontal/vertical a graph is, the less correlation the two variables are, and the graphs with a slope have a correlation. This graph is both horizontal and vertical, so let’s see how this turns out.
## [1] -0.04
Well, this graph shows a -0.04 correlation, which is a Weak Negative relationship. Anything with no correlation is bad enough, but this is worse! These two variables move in opposite directions - the longer the project length, the shorter the approval time (essentially, the graph is downward sloping rather than upward sloping). Thing is, it would be vital if the correlation was more negative than it is, but with -0.04, that’s basically nothing. In the end, it just means that these two variables have very little effect on each other, but the general trend is somewhat (very, very slightly) negative sloping.
The type of work that had the largest median project length was
Shell. Shell is a construction term used to
describe the exterior and the core of the building (which would be
walls, roofing, foundation, windows, etc.). In contrast,
Life Safety had the lowest median project length.
Addition and Remodel had the largest project length (which
would be 667, as observed from the Project Lengths
analysis). New had the second largest max, with
Repair coming in third and Remodel coming in
fourth. Interestingly, Life Safety, alongside having the
lowest median project length (and maximum project length for that
matter), also has the lowest number of occurrences. New has
the most amount of occurrences.
For the weekdays (M, T, W, Th, F), we can see that the
New category leads in occurrences. On the weekends (Sat,
Sun), due to the lower quantity of work being done, it’s harder to see
which one really takes the lead, but on a high level, New
takes the lead for Saturday and Addition and Remodel takes
the lead on Sunday.
One patterns of types of work is similar on Mondays, Tuesdays, and
Fridays while another is similar on Wednesdays and Thursdays.
The analysis we can perform here is simple: more work is assigned on
weekdays than weekends. Since the pattern is quite similar for all
weekdays, there would be no correlation between the type of work done
and the day the project is issued on.
Here, we can see that a majority of the projects did end up going to
completion, but even if the New column had the most amount
of completed projects, it also had the largest amount of active
projects. Repair had the largest amount of expired
projects, and Remodel had the largest amount of withdrawn
projects.
Seeing as how interesting it was to analyze the graphs before, let’s
try to create a MULTIVARIATE plot, which has 3 variables compared
against each other.
Above is the relationship between three variables: Work.Class, Status.Current, and Day.Issued. As seen, a majority of the projects were finalized, but a large proportion of them for each day remained active. Here, we are basically looking at the same data as the sixth analysis (Type of Work vs. Completion Status), but now we’ve split them up into days. Nevertheless, the pattern remains the same: the weekdays get a majority of the projects, with New being the highest for them all and the patterns of the data heights matching with other days.
Link to GitHub -> {GitHub}
The repository contains the Project2.Rmd,
Project2.html, and
Issued_Construction_permits_20241102.csv files.
The Project2.Rmd file contains the code for the
project.
The Project2.html file contains the html view of the
base project. This file will not be modified until the code in the Rmd
is changed and the file is knit.
The Issued_Construction_permits_20241102.csv file is
the csv that contains the data. This is directly what the permits
dataframe reads into.
The choice elements I have are A, D, E, G, and I.
A.1 - Line 238 - Getting the count of
Active occurrences in Status.Current
A.2 - Line 238 - Getting the count of
Expired occurrences in Status.Current
A.3 - Line 238 - Getting the count of VOID
occurrences in Status.Current
A.4 - Line 294 - Getting the correlation between
Project Length and Approval Time
D.1 - Line 258 - Mutated the dataframe with three
new columns: IssuedDayDiff, ExpiredDayDiff,
and ProjectLength
E.1 - Line 222 - Explaining how the NA values in the
Approval Time Lengths came from an NA value in the
Applied.Date column
E.2 - Line 238 - Explaining how the NA values in the
Project Lengths were a sum of Active, Expired,
VOID, and other statuses
G.1 - Line 052 - Converting the
Applied.Date column to be of type Date, not character
G.2 - Line 053 - Converting the
Issued.Date column to be of type Date, not character
G.3 - Line 054 - Converting the
Expires.Date column to be of type Date, not character
G.4 - Line 055 - Converting the
Completed.Date column to be of type Date, not character
I.1 - Line 347 - The multivariate plot containing
the relationship between Work.Class,
Status.Current, and Day.Issued
Throughout the project, we have done several analyses. And we found a lot of information too. For starters, we know the government has a lot of wrong information in their datasets, which is just hilarious. But jokes aside, we did gain some vital findings related to my prompt. A majority of the building permits issued in 2023 were New and Shell construction, which means more houses and more apartments were being built to house a larger number of incoming people. Most of the projects that were issued were also applied for in 2023, which goes to show the efficiency the construction workers had to be working at in order for the project to be completed. We found that while there was nearly no relationship between the time length of a project and the time it took for the project to be approved (issued), there is somewhat of a relationship between the type of project the work is and the time length of said project, which makes sense considering building a new home would take longer than adding life safety features to buildings. We found that the Zip Codes which had a majority of the projects were in downtown, indicating large amounts of growth within the city and the nearby suburban areas. While this project didn’t allow me to truly see how the population varied throughout the year, it provided key information to proving that the incoming population was large.
https://i.redd.it/dh75zlho59o71.jpg
Image used for Project
https://data.austintexas.gov/Building-and-Development/Building-Permits-Issued-in-2015/ayav-qckq
Data collection portal
https://data.austintexas.gov/Building-and-Development/Issued-Construction-Permits/3syk-w9eu/data_preview
Larger data portal, with information from all years (backup source for
information correction)